Non-record: Negative results — hardware alignment & quantization on 8xH100 by abaybektursun · Pull Request #670 · openai/parameter-golf

abaybektursun · 2026-03-25T02:20:33Z

Summary

30+ experiments attempting to improve the 11L d=512 transformer beyond PR Record: Full GPTQ + LeakyReLU² + Parallel Muon + BigramHash 3072 (val_bpb 1.1163, 3-seed mean) #593 (1.1171 BPB)
Every kernel optimization path tested: CUTLASS SM90, fused Triton GEMM, FP8, QKV fusion, custom CUDA
Quantization experiments: SpinQuant/Hadamard, mixed int5/int8, Soft-Round QAT, selective pruning
Architecture changes: XSA-all, VRL, Gated Attention, wider models, batch sizes, shard ordering
All negative or marginal on top of the PR Record: Full GPTQ + LeakyReLU² + Parallel Muon + BigramHash 3072 (val_bpb 1.1163, 3-seed mean) #593 Parallel Muon stack

Key finding

The 82ms training step is 95%+ optimized. torch.compile (PyTorch 2.9.1) handles all fusion automatically. cuBLAS is at the hardware limit for K=512. The competition at d=512 on H100 is won by quantization quality (bits-per-parameter), not kernel engineering (FLOPS-per-second).

Kernel-Level Optimization (All Dead)

Approach	Result	Why It Failed
CUTLASS SM90 TMA+WGMMA GEMM	2.5× slower than cuBLAS	cuBLAS heuristics beat default CUTLASS for 98304×512×1536. Built a working kernel — correct results, wrong speed.
Fused Triton GEMM + LeakyReLU²	1.82× faster fwd, 2.7× slower fwd+bwd	`torch.autograd.Function` bypasses Inductor. Backward runs in eager mode, 2-3× slower than Inductor's auto-generated Triton backward.
`torch.library.triton_op` for GEMM	Compile error	FakeTensor can't provide `data_ptr()` — GEMM kernels incompatible with triton_op tracing.
Custom CUDA C++ fused activation	6% slower	PyTorch's `vectorized_elementwise_kernel` is already highly optimized for pointwise ops.
Fused norm+residual (Triton)	Ties torch.compile exactly	0.136ms ours vs 0.136ms Inductor-generated. torch.compile already fuses this pattern.
FP8 training (TransformerEngine)	No speedup (90 vs 89ms)	At d=512, attention GEMMs are already memory-bound (AI=170-255). FP8 doubles peak FLOPS but also doubles the ridge point, making MORE ops memory-bound.
QKV fusion (8Q/4KV GQA)	3-17% slower	Fused (512→1024) GEMM is slightly faster, but splitting output into non-contiguous Q(512)/K(256)/V(256) tensors costs more than the GEMM savings.

Conclusion: torch.compile (PyTorch 2.9.1) already fuses CE+softcap+tanh, LeakyReLU²+residual, RMSNorm+backward, and all pointwise chains. cuBLAS is at the hardware limit for K=512 (~48% roofline, pipeline depth limitation). The 82ms step is 95%+ optimized.

torch.compile Gotchas

Issue	Impact	Mechanism
Late QAT recompilation	OOM with larger models	Flipping `_qat_enabled` mid-training changes the forward graph → torch.compile recompiles → memory spike exceeds 80GB
`torch.autograd.Function`	2-3× slower backward	Custom Functions bypass Inductor entirely. Backward runs uncompiled eager Python ops.
H100 memory compression	25-50% inflated benchmarks	Synthetic data (cudaMemset, BlockFillRandom, zeros) compresses in HBM hardware. Only `torch.randn` gives real numbers.

Quantization Experiments (Diminishing Returns)

Approach	BPB	Delta	Why It Failed
SpinQuant (Hadamard rotation before GPTQ)	1.1151	−0.0002	GPTQ's actorder + Cholesky already handles outliers. Rotation adds little on top. Artifact slightly larger (rotated weights compress worse).
Mixed-precision int5/int8 per-layer	1.1209	+0.006	int5 (31 levels) is too coarse. Boundary layers at int8 can't compensate for middle layers losing half their precision.
Soft-Round QAT (differentiable rounding)	1.1151	−0.0002	Identical to standard STE — the ~500 QAT steps aren't enough for the temperature annealing to have effect.
Selective ±1 pruning at 28-37%	1.1198-1.1204	+0.004-0.005	Too aggressive. Only <10% pruning is loss-neutral.

Architecture & Training (All Negative)

Approach	BPB	Delta	Why It Failed
XSA on all 11 layers (vs last 4)	worse at 100s	+0.014	2.9ms/step overhead. In our Parallel Muon stack, the slower step time costs more than XSA gains.
Value Residual Learning	1.1179	+0.0008	VRL conflicts with VE128 — both inject identity info into deep attention layers. Redundant.
Gated Attention	1.1197	+0.0026	4% slower step time. Per-head sigmoid gates add overhead not compensated by quality.
Weight decay 0.08 (vs 0.04)	1.1235	+0.008	Better at 100s, WORSE at 600s. Over-regularization prevents learning during warmdown. Early loss does not predict final post-quant BPB.
Batch size 1M tokens	1.1197	+0.003	Fewer steps (5,526 vs 7,189) hurt more than better gradients help.
Train bigger d=576 + int5	1.1233	+0.006	110ms/step = 24% fewer steps. Scaling law can't compensate.
Shard ordering (hard→easy)	1.1162	+0.0009	Per-shard loss spread only 0.3%. Disrupts natural diversity.
Legal TTT (22 experiments)	1.1177 best	+0.0006	Score-first constraint means model adapts too late.
Hessian all-reduce across GPUs	1.1169	−0.0002	256 batches per GPU already sufficient.

Meta-Lessons

The step is 95%+ optimized. torch.compile handles all fusion, cuBLAS is at hardware limit, FA3 already in use.
H100 is massively overprovisioned for this model. 21.5GB of 80GB GPU used. 99% of NVLink idle.
The competition is bits-per-parameter, not FLOPS-per-second. The quantization gap (0.022 BPB) is 10× larger than any kernel optimization.
Stale processes from nohup+torchrun accumulate silently, causing 2-3× performance degradation.
Early training loss doesn't predict final BPB. Fast A/B tests filter bad ideas but can't confirm good ones.

Test plan

All experiments on 8×H100 SXM, PyTorch 2.9.1+cu128, CUDA 12.8
Compared against PR Record: Full GPTQ + LeakyReLU² + Parallel Muon + BigramHash 3072 (val_bpb 1.1163, 3-seed mean) #593 baseline (1.1171 BPB, 3-seed)
Each result documented with failure mechanism

🤖 Generated with Claude Code

…xH100 30+ experiments on the PR openai#593 stack (1.1171 BPB), all negative or marginal: - CUTLASS SM90 GEMM: 2.5x slower than cuBLAS - Fused Triton GEMM+activation: autograd.Function kills backward - FP8, QKV fusion, custom CUDA: all slower or no improvement - SpinQuant, mixed int5/int8, Soft-Round QAT: noise-level - XSA-all, VRL, Gated Attention, bigger model, shard ordering: all worse - 22 legal TTT experiments: all worse than non-TTT baseline Key finding: 82ms step is 95%+ optimized. torch.compile handles all fusion. Competition at d=512 is bits-per-parameter, not FLOPS-per-second. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

abaybektursun mentioned this pull request Mar 25, 2026

Record: Val-Calibrated GPTQ + XSA-all + BigramHash 3072×112 #728

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: Negative results — hardware alignment & quantization on 8xH100#670

Non-record: Negative results — hardware alignment & quantization on 8xH100#670
abaybektursun wants to merge 1 commit intoopenai:mainfrom
abaybektursun:negative-results-hardware-alignment

abaybektursun commented Mar 25, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

abaybektursun commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key finding

Kernel-Level Optimization (All Dead)

torch.compile Gotchas

Quantization Experiments (Diminishing Returns)

Architecture & Training (All Negative)

Meta-Lessons

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

abaybektursun commented Mar 25, 2026 •

edited

Loading